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A Minicourse on Dynamic Multithreaded Algorithms 



This tutorial teaches dynamic multithreaded algorithms using a Cilk-like [11,8, 10] model. 
The material was taught in the MIT undergraduate class 6.046 Introduction to Algorithms as 
two 80-minute lectures. The style of the lecture notes follows that of the textbook by Cormen, 
Leiserson, Rivest, and Stein [7], but the pseudocode from that textbook has been "Cilkified" 
to allow it to describe multithreaded algorithms. The first lecture teaches the basics behind 
multithreading, including defining the measures of work and critical-path length. It culminates 
in the greedy scheduling theorem due to Graham and Brent [9, 6]. The second lecture shows 
how parallel appUcations, including matrix multiplication and sorting, can be analyzed using 
divide-and-conquer recurrences. 

1 Dynamic multithreaded programming 

As multiprocessor systems have become increasingly available, interest has grown in parallel pro- 
gramming. Multithreaded programming is a programming paradigm in which a single program 
is broken into multiple threads of control which interact to solve a single problem. These notes 
provide an introduction to the analysis of "dynamic" multithreaded algorithms, where threads can 
be created and destroyed as easily as an ordinary subroutine can be called and return. 



Our model of dynamic multithreaded computation is based on the procedure abstraction found in 
virtually any programming language. As an example, the procedure FiB gives a multithreaded 
algorithm for computing the Fibonacci numbers:^ 

*Support was provided in part by the Defense Advanced Research Projects Agency (DARPA) under Grant F30602- 
97-1-0270, by the National Science Foundation under Grants ElA-9975036 and ACl-0324974, and by the Singapore- 
MIT Alliance. 

' This algorithm is a terrible way to compute Fibonacci numbers, since it runs in exponential time when logiirithmic 
methods are known [7, pp. 902-903], but it serves as a good didactic example. 
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FlB(n) 

1 if n < 2 

2 then return n 

3 X <— spawn Fib (n — 1) 

4 y spawn FiB (n — 2) 

5 sync 

6 return {x + y) 

A spawn is the parallel analog of an ordinary subroutine call. The keyword spawn before the 
subroutine call in line 3 indicates that the subprocedure FlB(n — 1) can execute in parallel with 
the procedure Fib (n) itself. Unlike an ordinary function call, however, where the parent is not 
resumed until after its child returns, in the case of a spawn, the parent can continue to execute in 
parallel with the child. In this case, the parent goes on to spawn FiB(n — 2). In general, the parent 
can continue to spawn off children, producing a high degree of parallelism. 

A procedure cannot safely use the return values of the children it has spawned until it executes 
a sync statement. If any of its children have not completed when it executes a sync, the procedure 
suspends and does not resume until all of its children have completed. When all of its children 
return, execution of the procedure resumes at the point immediately following the sync statement. 
In the Fibonacci example, the sync statement in line 5 is required before the return statement 
in line 6 to avoid the anomaly that would occur if x and y were summed before each had been 
computed. 

The spawn and sync keywords specify logical parallelism, not "actual" parallelism. That is, 
these keywords indicate which code may possibly execute in parallel, but what actually runs in 
parallel is determined by a scheduler, which maps the dynamically unfolding computation onto 
the available processors. 

We can view a multithreaded computation in graph-theoretic terms as a dynamically unfolding 
dag G — (V, as is shown in Figure 1 for Fib. We define a thread to be a maximal sequence 
of instructions not containing the parallel control statements spawn , sync, and return . Threads 
make up the set V of vertices of the multithreaded computation dag G. Each procedure execution is 
a linear chain of threads, each of which is connected to its successor in the chain by a continuation 
edge. When a thread u spawns a thread v, the dag contains a spawn edge {u, v) e E, as well 
as a continuation edge from u to u's successor in the procedure. When a thread u returns, the 
dag contains an edge {u, v), where v is the thread that immediately follows the next sync in the 
parent procedure. Every computation starts with a single initial thread and (assuming that the 
computation terminates), ends with a single final thread. Since the procedures are organized in a 
tree hierarchy, we can view the computation as a dag of threads embedded in the tree of procedures. 

1.2 Performance Measures 

Two performance measures suffice to gauge the theoretical efficiency of multithreaded algorithms. 

We define the work of a multithreaded computation to be the total time to execute all the operations 
in the computation on one processor. We define the critical-path length of a computation to be 
the longest time to execute the threads along any path of dependencies in the dag. Consider, for 
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Figure 1: A dag representing the multithreaded computation of Fib (4). Threads are shown as circles, and 
each group of threads belonging to the same procedure are surrounded by a rounded rectangle. Downward 
edges are spawns dependencies, horizontal edges represent continuation dependencies within a procedure, 
and upward edges are return dependencies. 



example, the computation in Figure 1. Suppose that every thread can be executed in unit time. 
Then, the work of the computation is 17, and the critical-path length is 8. 

When a multithreaded computation is executed on a given number P of processors, its miming 
time depends on how efficiently the underlying scheduler can execute it. Denote by Tp the running 
time of a given computation on P processors. Then, the work of the computation can be viewed 
as Ti, and the critical-path length can be viewed as T^. 

The work and critical-path length can be used to provide lower bounds on the running time on 
P processors. We have 

Tp > T,/P , (1) 
since in one step, a P-processor computer can do at most P work. We also have 

Tp > Too , (2) 

since a P-processor computer can do no more work in one step than an infinite-processor computer. 

The speedup of a computation on P processors is the ratio Ti/Tp, which indicates how many 
times faster the P-processor execution is than a one-processor execution. If Ti/Tp — 0(P), then 
we say that the P-processor execution exhibits linear speedup. The maximum possible speedup is 
Ti/Too, which is also called the parallelism of the computation, because it represents the average 
amount of work that can be done in parallel for each step along the critical path. We denote the 
parallelism of a computation by P. 



1.3 Greedy Scheduling 

The programmer of a multithreaded application has the ability to control the work and critical-path 
length of his application, but he has no direct control over the scheduling of his application on a 
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given number of processors. It is up to the runtime scheduler to map the dynamically unfolding 
computation onto the available processors so that the computation executes efficiently. Good on- 
line schedulers are known [3, 4, 5] but their analysis is complicated. For simplicity, we'll illustrate 
the principles behind these schedulers using an off-line "greedy" scheduler. 

A greedy scheduler schedules as much as it can at every time step. On a P-processor computer, 
time steps can be classified into two types. If there are P or more threads ready to execute, the step 
is a complete step, and the scheduler executes any P threads of those ready to execute. If there are 
fewer than P threads ready to execute, the step is an incomplete step, and the scheduler executes 
all of them. This greedy strategy is provably good. 

Theorem 1 (Graham [9], Brent [6]) A greedy scheduler executes any multithreaded computation 
G with work Ti and critical-path length in time 

Tp<T,/P + T^ (3) 

on a computer with P processors. 

Proof. For each complete step, P work is done by the P processors. Thus, the number of com- 
plete steps is at most Ti / P, because after Ti/P such steps, all the work in the computation has been 
performed. Now, consider an incomplete step, and consider the subdag G' of G that remains to be 
executed. Without loss of generality, we can view each of the threads executing in unit time, since 
we can replace a longer thread with a chain of unit- time threads. Every thread with in-degree is 
ready to be executed, since all of its predecessors have already executed. By the greedy scheduling 
policy, all such threads are executed, since there are strictly fewer than P such threads. Thus, the 
critical-path length of G' is reduced by 1. Since the critical-path length of the subdag remaining 
to be executed decreases by 1 each for each incomplete step, the number of incomplete steps is at 
most Too. Each step is either complete or incomplete, and hence Inequality (3) follows. □ 

Corollary 2 A greedy scheduler achieves linear speedup when P — 0{P). 

Proof. Since P = Ti/Too, we have P = 0{Ti/T^), or equivalently, that Too = 0{Ti/P). Thus, 
we have Tp < Ti/P + T^ = 0{Ti/P). □ 

1.4 Cilk and ★Socrates 

Cilk [4, 11, 10] is a parallel, multithreaded language based on the serial programming language C. 

Instrumentation in the Cilk scheduler provides an accurate measure of work and critical path. Cilk's 
randomized scheduler provably executes a multithreaded computation on a P-processor computer 
in Tp = Ti/P + O(Too) expected time. Empirically, the scheduler achieves Tp pa Ti/P + T^o 
time, yielding near-perfect linear speedup if P <^ P. 

Among the applications that have been programmed in Cilk are the ★Socrates and Cilkchess 
chess-playing programs. These programs have won numerous prizes in international competition 
and are considered to be among the strongest in the world. An interesting anomaly occurred 
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during the development of ^Socrates which was resolved by understanding the measures of work 
and critical-path length. 

The ^Socrates program was initially developed on a 32-processor computer at MIT, but it was 
intended to run on a 512-processor computer at the National Center for Supercomputing Appli- 
cations (NCSA) at the University of Illinois. A clever optimization was proposed which, during 
testing at MIT, caused the program to run much faster than the original program. Nevertheless, the 
optimization was abandoned, because an analysis of work and critical-path length indicated that 
the program would actually be slower on the NCSA machine. 

Let us examine this anomaly in more detail. For simplicity, the actual timing numbers have 
been simplified. The original program ran in T32 = 65 seconds at MIT on 32 processors. The 
"optimized" program ran in T32 = 40 seconds also on 32 processors. The original program had 
work Ti = 2048 seconds and critical-path length Too = 1 second. Using the formula Tp = 
Ti/P + Too as a good approximation of runtime, we discover that indeed T32 = 65 = 2048 /32 + 1 . 
The "optimized" program had work Ti = 1024 seconds and critical-path length Too = 8 seconds, 
yielding T32 = 40 = 1024/32 + 8. But, now let us determine the runtimes on 512 processors. 
We have T512 = 2048/512 + 1 = 5 and T^i2 = 1024/512 + 8 = 10, which is twice as slow! 
Thus, by using work and critical-path length, we can predict the performance of a multithreaded 
computation. 

Exercise 1-1. Sketch the multithreaded computation that results from executing Fib (5) . Assume 
that all threads in the computation execute in unit time. What is the work of the computation? 
What is the critical-path length? Show how to schedule the dag on 2 processors in a greedy fashion 
by labeling each thread with the time step on which it executes. 

Exercise 1-2. Consider the following multithreaded procedure Sum for pairwise adding the ele- 
ments of arrays ^4(1 . . n] and B[l . .n] and storing the sums in C[l . . n]: 

SlJM(A, B,C) 

1 for i ^ 1 to length[A] 

2 do C[i] <^ spawn Add {A[i],B[i]) 

3 sync 

ADD(a;, y) 
1 return {x + y) 

Determine an asymptotic bound on the work, the critical-path length, and the parallelism of the 
computation in terms of n. Give a divide-and-conquer algorithm for the problem that is as parallel 
as possible. Analyze your algorithm. 

Exercise 1-3. Prove that a greedy scheduler achieves the stronger bound 

Tp<(Ti-Too)/P + Too. (4) 

Exercise 1-4. Prove that the time for a greedy scheduler to execute any multithreaded computa- 
tion is within a factor of 2 of the time required by an optimal scheduler. 
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Exercise 1-5. For what number P of processors do the two chess programs described in this 
section run equally fast? 

Exercise 1-6. Professor Tweed takes some measurements of his (deterministic) multithreaded 
program, which is scheduled using a greedy scheduler, and finds that T4 = 80 seconds and 
764 = 10 seconds. What is the fastest that the professor's computation could possibly run on 
10 processors? Use Inequality (4) and the two lower bounds from Inequalities (1) and (2) to derive 
your answer. 

2 Analysis of multithreaded algorithms 

We now turn to the design and analysis of multithreaded algorithms. Because of the divide-and- 
conquer nature of the multithreaded model, recurrences are a natural way to express the work 
and critical-path length of a multithreaded algorithm. We shall investigate algorithms for matrix 
multiplication and sorting and analyze their performance. 

2.1 Parallel Matrix Multiplication 

To multiply two n x n matrices A and B in parallel to produce a matrix C, we can recursively 
formulate the problem as follows: 



Thus, each n x n matrix multiplication can be expressed as 8 multiplications and 4 additions of 
(n/2) X (n/2) submatrices. The multithreaded procedure Mult multiplies two n x n matrices, 
where n is a power of 2, using an auxiliary procedure Add to add n x n matrices. This algorithm 
is not in-place. 

ADD(C,r,n) 

1 if n = 1 

2 then C[l, 1] ^ C[l, 1] + r[l, 1] 

3 return 

4 partition C and T into (n/2) x (n/2) submatrices 

5 spawn Add(Cii, Tn, n/2) 

6 spawn Add(Ci2, T12, n/2) 

7 spawn Add(C2i, T21, n/2) 

8 spawn Add(C22, T22, n/2) 

9 sync 
10 return 




) 



An ^12 y \ 
A21 A22 J \ B21 B22 J 
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MiJUT{C,A,B,n) 

1 if n = 1 

2 then C[l, 1] ^ A[l, 1] ■ B[l, 1] 

3 return 

4 allocate a temporary matrix T[l . . n, 1 . . n] 

5 partition A, B, C, and T into {n/2) x (n/2) submatrices 

6 spawn Mult{Cu, An, Bn,n/2) 

7 spawn Mult(Ci2, An, i?i2, n/2) 

8 spawn Mult(C2i, A21, -Bn, n/2) 

9 spawn Mult(C22, ^21, ^12, ^/2) 

10 spawn Mult(Tii, Ai2,52i,n/2) 

11 spawn MULT(Ti2,Ai2,522,n/2) 

12 spawn Mult(T2i, A22, ^21, ri/2) 

13 spawn MULT(r22, ^22, -B22, ri/2) 

14 sync 

15 ADD{C,T,n) 

The matrix partitionings in line 5 of MULT and line 4 of ADD take 0{1) time, since only a constant 
number of indexing operations are required. 

To analyze this algorithm, let A p (n) be the P-processor running time of Add on n x n matrices, 
and let Mp{n) be the P-processor running time of Mult on n x n matrices. The work (running 
time on one processor) for Add can be expressed by the recurrence 

Ai{n) = 4Ai(n/2) + e(l) 
= e(n2), 

which is the same as for the ordinary double-nested-loop serial algorithm. Since the spawned 
procedures can be executed in parallel, the critical-path length for Add is 

Aoo{n) = Aoo(n/2) + e(l) 
= e(lgn). 

The work for MULT can be expressed by the recurrence 

Mi(n) = 8Mi{n/2) + Ai{n) 
= 8Mi(n/2) + e(n^) 

which is the same as for the ordinary triple-nested-loop serial algorithm. The critical-path length 
for Mult is 



M^{n) = Moo(n/2) + e(lgn) 
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Thus, the parallehsm for Mult is Mi (n) / M^o (n) = 9 (n^ / Ig^ n) , which is quite high. To multiply 
1000 X 1000 matrices, for example, the parallelism is (ignoring constants) about 1000^/10^ = 10^. 
Most parallel computers have far fewer processors. 

To achieve high performance, it is often advantageous for an algorithm to use less space, 
because more space usually means more time. For the matrix-multiplication problem, we can 
eliminate the temporary matrix T in exchange for reducing the parallelism. Our new algorithm 
Mult- Add performs C ^ C + A ■ B using a similar divide-and-conquer strategy to Mult. 

Mult- Add (C, A,B,n) 

1 if n = 1 

2 then C[l, 1] ^ C[l, 1] + A[l, 1] • B[l, 1] 

3 return 

4 partition A, B, and C into (n/2) x in/2) submatrices 

5 spawn Mult-Add(Cii, An, En, n/2) 

6 spawn Mult-Add(Ci2, ^n, -B12, n/2) 

7 spawn Mult-Add(C2i, ^21, ^n, n/2) 

8 spawn Mult-Add(C22, ^21, -B12, J^/S) 

9 sync 

10 spawn MuLT-ADD(Cii,y4i2,-B2i,?T'/2) 

11 spawn Mult-Add(Ci2,Ai2, -622,^/2) 

12 spawn Mult-Add(C2i- -422, B2i,n/2) 

13 spawn Mult-Add(C22, ^22, -B22, ?T'/2) 

14 sync 

15 return 

Let MAp{n) be the P-processor running time of Mult-Add on n x n matrices. The work for 
Mult- Add is MAi{n) = 0(n^), following the same analysis as for Mult, but the critical-path 
length is now 

MAoo(n) = 2MAoo(n/2) + e(l) 
= e(n), 

since only 4 recursive calls can be executed in parallel. 

Thus, the parallehsm is MAi(n)/MAoc (n) = 6(n^). On 1000 x 1000 matrices, for example, the 
parallelism is (ignoring constants) still quite high: about 1000^ = 10®. In practice, this algorithm 
often runs somewhat faster than the first, since saving space often saves time due to hierarchical 
memory. 
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Figure 2: Illustration of P-Merge. The median of array A is used to partition array B, and then the lower 
portions of the two arrays are recursively merged, as, in parallel, are the upper portions. 

2.2 Parallel Merge Sort 

This section shows how to parallelize merge sort. We shall see the parallelism of the algorithm 
depends on how well the merge subroutine can be parallelized. 

The most straightforward way to parallelize merge sort is to run the recursion in parallel, as is 
done in the following pseudocode: 

Merge-Sort(A,p, r) 

1 it p < r 

2 then q ^ |_(p + r)/2j 

3 spawn Merge-Sort(A,p, 

4 spawn Merge-Sort(A, q + l,r) 

5 sync 

6 Merge(^,p, g,r) 

7 return 



The work of Merge-Sort on an array of n elements is 

Ti{n) = 2Ti(n/2) + e(n) 
= e(nlgn) , 

since the running time of MERGE is 6(n). Since the two recursive spawns operate in parallel, the 
critical-path length of Merge-Sort is 

Too(n) = roo(n/2) + e(n) 
= e(n). 

Consequently, the parallelism of the algorithm is Ti{n)/T^{n) — 6(lgn), which is puny. The 
obvious bottleneck is Merge. 

The following pseudocode, which is illustrated in Figure 2, performs the merge in parallel. 
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P-Merge(A[1 ..l],B[l..m],C[l..n]) 

1 if m > / l> without loss of generality, larger array should be first 

2 then P-Merge(5[1 . . m], A[1 . . C[l . . n]) 

3 return 

4 if n = 1 

5 then C[l] ^[1] 

6 return 

7 if Z = 1 [> and m — 1 

8 thenif A[l] < S[l] 

9 then C[l] ^ A[1];C[2] ^ B[l] 

10 else ^[1] ^ 5[1]; C[2] ^ ^[1] 

1 1 return 

12 find J such that B[j] < A[l/2] < B[j + 1] using binary search 

13 spawn P-Merge{A[1 . .{1/2)], B[l . . j],C[l . .{1/2 + j)]) 

14 spawn P-Merge{A[{1/2 + 1) . .1], B[{j + 1) . .m\,C[{l/2 + j + 1) . .n\) 

15 sync 

16 return 

This merging algorithm finds the median of the larger array and uses it to partition the smaller 
array. Then, the lower portions of the two arrays are recursively merged, and in parallel, so are the 
upper portions. 

To analyze P-Merge, let PMp{n) be the P-processor time to merge two arrays A and B 
having n = m + I elements in total. Without loss of generality, let A be the larger of the two 
arrays, that is, assume / > m. 

We'll analyze the critical-path length first. The binary search of B takes 9(lgm) time, which 
in the worst case is G(lg n). Since the two recursive spawns in lines 13 and 14 operate in parallel, 
the worst-case critical-path length is ©(Ign) plus the worst-case critical path-length of the spawn 
operating on the larger subarrays. In the worst case, we must merge half of A with all of B, in 
which case the recursive spawn operates on at most 3n/4 elements. Thus, we have 

PMM < PM^{3n/A) + e{\gn) 
= e(lg^n). 

To analyze the work of Merge, observe that although the two recursive spawns may operate 
on different numbers of elements, they always operate on n elements between them. Let an 
be the number of elements operated on by the first spawn, where a is a constant in the range 
1/4 < a < 3/4. Thus, the second spawn operates on (1 — a)n elements, and the worst-case work 
satisfies the recurrence 

PMi(n) = PMi{an) + PMi((l - a)n) + e(lgn) . (5) 

We shall show that PMi{n) = 0(n) using the substitution method. (Actually, the Akra-Bazzi 
method [2], if you know it, is simpler.) We assume inductively that PMi (n) < an — blgn for some 
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constants a,b > 0. We have 

PMi{n) < aan — b\g(an) + a{l — o)n — b\g{{l — a)n) + Q{lgn) 

— an — b{lg{an) +\g{{l — a)n)) + Q{lgn) 

— an — b{\ga + Ign + lg(l — a) +\gn) + O(lgn) 
= an — 6 Ign — {b{lgn + lg(Q!(l — a))) — 6(lgn)) 
< an — 6 Ig n , 

since we can choose b large enough so that 6(lgn + lg(Q;(l — a))) dominates ©(Ign). Moreover, 
we can pick a large enough to satisfy the base conditions. Thus, PMi(n) = ©(n), which is the 
same work asymptotically as the ordinary, serial merging algorithm. 

We can now reanalyze the Merge-Sort using the P-Merge subroutine. The work Ti(n) 
remains the same, but the worst-case critical-path length now satisfies 

Too{n) = Too(n/2) + ©(lg2n) 
= ©(Ig^n). 

The parallelism is now ©(n Ig n) / ©(Ig^ n) — Q{n/ Ig^ n). 

Exercise 2-1. Give an efficient and highly parallel multithreaded algorithm for multiplying an 
n X n matrix /I by a length-n vector x that achieves work 0(n^) and critical path 0(lg n). Analyze 
the work and critical-path length of your implementation, and give the parallelism. 

Exercise 2-2. Describe a multithreaded algorithm for matrix multiplication that achieves work 
0(n^) and critical path 0(lgn). Comment informally on the locality displayed by your algorithm 
in the ideal cache model as compared with the two algorithms from this section. 

Exercise 2-3. Write a Cilk program to multiply an ni x n2 matrix by an n2 x ns matrix in parallel. 
Analyze the work, critical-path length, and parallelism of your implementation. Your algorithm 
should be efficient even if any of ni, n2, and n3 are 1. 

Exercise 2-4. Write a Cilk program to implement Strassen's matrix multiplication algorithm in 
parallel as efficiently as you can. Analyze the work, critical-path length, and parallelism of your 
implementation. 

Exercise 2-5. Write a Cilk program to invert a symmetric and positive-definite matrix in parallel. 
(Hint: Use a divide-and-conquer approach based on the ideas of Theorem 31.12 from [7].) 

Exercise 2-6. Akl and Santoro [1] have proposed a merging algorithm in which the first step is to 
find the median of all the elements in the two sorted input arrays (as opposed to the median of the 
elements in the larger subarray, as is done in P-Merge). Show that if the total number of elements 
in the two arrays is n, this median can be found using ©(Ig n) time on one processor in the worst 
case. Describe a linear- work multithreaded merging algorithm based on this subroutine that has a 
parallelism of 0(n/ Ig^ n). Give and solve the recurrences for work and critical-path length, and 
determine the parallelism. Implement your algorithm as a Cilk program. 
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Exercise 2-7. Generalize the algorithm from Exercise 2-6 to find arbitrary order statistics. De- 
scribe a merge-sorting algorithm with ©(nlgn) work that achieves a parallelism of ©(n/lgn). 
(Hint: Merge many subarrays in parallel.) 

Exercise 2-8. The length of a longest-common subsequence of two length-n sequences x and y 
can be computed in parallel using a divide-and-conquer multithreaded algorithm. Denote by c[i, j] 
the length of a longest common subsequence of a;[l . .i] and y[l . .j]. First, the multithreaded 
algorithm recursively computes c[i,j] for all i in the range 1 < i < n/2 and all j in the range 
1 < j < ^/2. Then, it recursively computes c[i,j] for 1 < i < n/2 and n/2 < j < n, while in 
parallel recursively computing c[i,j] for n/2 < i < n and 1 < j < n/2. Finally, it recursively 
computes c[i,j] for n/2 < i < n and n/2 < j < n. For the base case, the algorithm computes 
c[i,j] in terms of c[i — 1, j — 1], c[i — 1, j], and c[i,j — 1] in the ordinary way, since the logic of 
the algorithm guarantees that these three values have already been computed. 

That is, if the dynamic programming tableau is broken into four pieces 



then the recursive multithreaded code would look something like this: 
I 

spawn II 
spawn III 
sync 

IV 

return 



Analyze the work, critical-path length, and parallelism of this algorithm. Describe and analyze 
an algorithm that is asymptotically as efficient (same work) but more parallel. Make whatever 
interesting observations you can. Write an efficient Cilk program for the problem. 
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